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Asymptotic optimality of a 
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Abstract: In this article we study the asymptotic predictive optimality of 
a model selection criterion based on the cross-validatory predictive density, 
already available in the literature. For a dependent variable and associated 
explanatory variables, we consider a class of linear models as approximations 
to the true regression function. One selects a model among these using the 
criterion under study and predicts a future replicate of the dependent variable 
by an optimal predictor under the chosen model. We show that for squared 
error prediction loss, this scheme of prediction performs asymptotically as well 
as an oracle, where the oracle here refers to a model selection rule which 
minimizes this loss if the true regression were known. 
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1. Introduction 

The ultimate goal of modeling in any scientific or sociological investigation is to 
discover the underlying regular pattern or phenomenon, if any, which controls the 
data generating mechanism. Although it is almost impossible to imagine that a 
single model or combinations of a handful will fully capture the intricate functioning 
of nature or sociological issues, one can always hope to be able to come close. Given 
a choice of several models and a set of data, a popular method is to choose the model 
which explains or fits the given data best (in some well-defined sense). However, 
it is of prime importance that any model that is chosen should be able to predict 
future observations from the same experiment or process reasonably well and that 
it does not merely fit the observed data. This is the purpose of predictive model 
selection. 
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One of the most prominent approaches to predictive model selection is cross- 
validation (see [17]) and variants thereof. As the name cross-validation suggests, 
parameters of the population are estimated under each model by using a part of 
the data (the "estimation set"), while the rest of the data (the "validation set") 
are predicted using the estimates based on the first group. This is done repeatedly 
by using "validation sets" comprising different parts of the data, e.g., the whole 
data could simply be divided into 10 disjoint parts, each part consisting of an equal 
number of observations and predicted using the rest. If, for a particular model, 
such predictions match best with the actually observed values, i.e., if the average 
prediction error is the smallest for it among all the candidate models, it is selected. 
Optimality properties of classical cross-validatory techniques have been studied, 
e.g., in [12] and [16]. 

In the Bayesian literature, several approaches to model selection have been stud- 
ied with the predictive aspect in mind; see, e.g., [1, 4, 5, 8, 9, 10, 13, 14]. The purpose 
of this paper is to study the predictive properties of a model selection criterion (see 
(1.2) below) based on the average of the (log) cross-validatory predictive densities 
(see (1.1) below) and already available in the literature. Different types of averages 
(e.g. arithmetic mean, (log) geometric mean) of cross-validatory predictive densities 
have been studied by several authors ([2], [3], [5], [9] and [14]). Chakrabarti and 
Ghosh [5] considered an average with respect to disjoint validation sets and studied 
what should be the optimal proportion of the sample kept for validation in large 
sample sizes, for the selection of a model closest to the true model (in terms of 
Kullback-Leibler divergence), and for the selection of the more parsimonious model 
if two models are equidistant from the truth. Using squared error prediction loss, 
we show that model selection using criterion (1.2) has an optimality property in 
predicting a future replicate of the dependent variable (for fixed values of the in- 
dependent variables), when the true regression is being approximated by a class 
of candidate linear models. The proofs of the optimality results partly use some 
general techniques of Li [12] which were later adopted in [16]. 

In the Bayesian setup, the ordinary predictive density under a model is defined as 
the integral of the likelihood function of the observed data with respect to the prior 
distribution of the parameters under the model. Between two competing models, 
the one having a larger predictive density for the given data seems to be the more 
appropriate description of the unknown data generating process. In non-subjective 
Bayesian analysis, it is common to use noninformative priors for the parameters 
which are typically improper and defined only up to unknown multiplicative con- 
stants. In such situations, use of the ordinary predictive density as a model selection 
criterion will be inappropriate. To get rid of this difficulty, one updates the improper 
prior by getting a proper posterior based on part of the data (called the training 
sample) and then integrates the likelihood function of the rest of the data with 
respect to this posterior, thus giving the cross-validatory predictive density. This 
is like getting the predictive distribution of part of the data using information ob- 
tained from the rest of it. This method of obtaining a cross-validatory predictive 
density can also be used when one puts a proper prior on the parameters of the 
model. The cross-validatory predictive density can then be used to get pseudo-Bayes 
Factors, after appropriate averaging with respect to the different possible choices of 
the training sample. This line of thought owes its origin to Geisser [7] and Geisser 
and Eddy [8] and came to prominence through what are referred to as partial Baycs 
Factors or Intrinsic Bayes Factors ([2], [3], [9], [11] and [15]). 

In the next few paragraphs, we describe our setup and the model selection cri- 
terion we study. We follow the notations of Shao [16]. 
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Let y n = (j/i, . . . , y n y be a vector of observations on the dependent (response) 
variable and let X n = (x^, . . . ,x' n )' be an n x p n matrix of explanatory variables 
(which are potentially responsible for the variability in the j/'s), with Xi associated 
with yi. Let /x n denote E(y n \X n ), the (unknown) average value of the response 
variable given the values of the explanatory variables. We further assume that given 
= y n - fx n has mean vector and the components of a are independent 
with common variance a 2 , which could be known or unknown. We are interested 
in capturing the functional relationship, if any, between \x n and X n which will be 
most suitable for predictive purposes. We restrict our search within a class of normal 
linear models. Our model space, denoted A n , is indexed by a, where each a consists 
of a subset of size p n (pi) (1 < p n (c) < Pn) of {1, 2, . . . ,p n } and the true mean fj, n 
is assumed to be linearly related to the corresponding explanatory variables. More 
specifically, under model a G An, y„ ~ N(fi n (a) = X n (a)/3 n (a),<r 2 I n ) where 
X n (a) is the submatrix of X n consisting of the p n (a) columns specified by a and 
(3 n {a) e W^ a \ A Bayesian puts a prior on the unknown parameters within each 
model. We consider standard non-subjective priors (see e.g., [1]) given by 

Tr a (/3 n (aj) oc 1 if cr 2 is known, and 

TT a (/3 n (a),a 2 ) oc — ^ if cr 2 is unknown. 

Consider, for example, the case with a 2 unknown. Let ir a (((3 n {a), a 2 )\yk+i, ■ ■ ■ , y n ) 
denote the posterior distribution of the parameters under the model given the obser- 
vations (yk+i, ■ ■ ■ , Un)- The cross- validatory predictive density of (yi, . . . , j/fc) given 
(yk+i, ■ ■■ ,Vn) under model a, denoted by the expression f a (yi, ■ ■ ■ , Vh\Vk+i, Un), 
is given by 

(1-1) J f/3j aha 2(yi, ■ ■ . ,?/fc)7r Q ((/3„(a),cr 2 )|y fc+ i, . . . ,y n ) d(3 n (a) da 2 , 

where f ^ ^ a2 (yi, ■ ■ ■ , yk) denotes the density of the k dimensional normal vector, 
with mean vector given by the first k components of fi n (a) and variance-covariancc 
matrix <r 2 Ik, evaluated at y%, . . . , yk- Similarly, the predictive density of any subset 
(y tl , . . . ,yt k ) of y, given the rest of the components of y under this model can be 
calculated, where (ti, . . . , tk) denotes a subset of (1, . . . , n). Since a good criterion 
should not depend too much on the choice of the training sample, we consider 
the geometric mean of the cross-validatory predictive densities thus obtained by 
varying the choice of the training sample. The ratio of such geometric means for 
two models is precisely the Geometric Intrinsic Bayes Factor ([2], [3]). For model 
a, the criterion which we intend to study equals the logarithm of this geometric 
mean. Thus if we consider a total of r training samples, this logarithm is given by 

1 r 

(i.2) cv(a) = -Y^ lo gf<*(ytu,---,ytJ{yf-tt(t u ,...,tki)}), 

i=l 

where {yt ti , ■ ■ ■ ,yt ki ) is the set of y observations not included in the i-th training 
sample. One selects the model a n £ A n which maximizes CV(a). 

Once a model is thus selected, we use the mean of the predictive distribu- 
tion of y^ ew , given the observed y n under the selected model, as the predictor 
for a future replicate yJJ ew of the response variable for the same value X n of 
the explanatory variables. An easy calculation shows that this turns out to be 
the least squares estimate X(a n )(3 n (a n ) where $ n {a) = P n (a)y n and P n (a) = 
X n (a)(X n (a)' X n (a))~ 1 X n (a) is the usual projection matrix. 
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Our goal is to evaluate this prediction scheme under the true regression us- 
ing squared error prediction loss. Under the true fj, n , the future replicate y ncw 
will be independent of the original observations y n . The quality of any predic- 
tor S(y n ) of y new based on y n can be evaluated by the average prediction er- 
ror Efj, (^||y!J ew — S(y n )\\ 2 ), where Efj, denotes expectation with respect to the 
joint distribution of (y new ,y n ) when fj, n is the true unknown mean. This expec- 
tation will be small if, for any fixed y nl Efj (^||y new — 5{y n )\\ 2 \y n ) is also small. 
As observed before, the predictor S(y n ) we want to evaluate is the same as the 
least squares predictive estimate of yJJ cw under the chosen model a n . Now note 
that for any given fixed model a, the least squares predictive estimate is given by 
S(y n ) = 5(y n )(a) = fi n {a) = X n (a)0 n (a). A simple algebra shows that the above 
conditional expectation is, up to a constant which does not depend on a, equal to 

(1.3) L n {a) = . 



Hence the conditional expectation will be minimized for a certain a if L n {a) is 
minimized. If we knew the true fj, n , we could find the model which minimizes this 
L n (a) for each y n . We shall call this the oracle model, denoted a^. The best any 
procedure can achieve is to do as well as the oracle in the limit in terms of the loss 
as the sample size grows to infinity. 

We show in the following sections of this article that under certain conditions, 
minimizing CV(a) with respect to a is asymptotically equivalent to minimizing 
L n (a). Using this fact it is shown that the ratio of L n (a^) to L n (a n ) tends to 1 
in probability, whereby establishing the optimum asymptotic behavior of criterion 
(1.2) in the problem of prediction of a set of future observations. 

In Sections 2 and 3 we consider the case where the true model is not in the model 
space - the proposed models are only approximations to the truth. In Section 2 we 
consider the case when a 2 is known. We show that under certain assumptions, the 
model selection procedure under study performs as well as the oracle asymptotically 
in the sense that the ratio of their losses tends to one in probability. In Section 3, 
we consider the more realistic situation when a 2 is unknown. Under appropriate 
conditions it is shown that this procedure also achieves the oracle asymptotically in 
this case. As a validation of this method, we next consider in Section 4 the question 
of whether, under the assumption that the true model is indeed included in the 
model space, we do equally well in terms of hitting the oracle loss asymptotically. 
It is shown that this model selection procedure chooses the correct model with 
smallest dimension with probability tending to one in addition to being asymptot- 
ically optimal in terms of hitting the oracle. Some concluding remarks are made in 
Section 5. Technical proofs of most of the results are given in the Appendix. 

For notational simplicity we write y, fi, e, X(a), /3(a) and P{a) in place of y n , 
fj, n , e„, X n (a), (3 n (a) and P n {a) respectively, dropping the suffix n for the rest of 
the paper. 



2. Basic results case with er 2 known 

In this section we take the "model false" point of view that the models are only 
approximations to the truth but none of them is actually true. We show that under 
certain conditions, the model selection procedure under study is asymptotically 
optimal in the sense of performing as well as the oracle defined above. 
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As described in the introduction, the model selection criterion under considera- 
tion is an average of the cross- validatory predictive density 

f a (yi, ■ ■ -,yk\yk+i, ■ ■ -,Vn) 

under model a, over suitable choices of the "training sample" {i/k+i, ■ ■ ■ ,y n }- We 
do not recommend here any particular choice of the training samples; our results 
hold as long as each 1 < i < n, appears in the same number of training samples 
chosen (which will be assumed throughout the paper). 

Let y il i = 1, . . . , r be the r training samples (each of size n — k). For each y i; let 
\i i and Bi be the subvector of and e corresponding to the labels of the components 
of y i and Xi(a) be the submatrix of X(a) consisting of the corresponding rows of 
it. Also, let 0i(a) = [X'^X^a)]- 1 X' t {a)y t , P t {a) = A l (a)[A 4 '(a)A ?; (a)]- 1 A 4 '(a), 
i = 1, . . . , r. It will be assumed throughout that (n — k) — > oo and X-(a)Xi(a) is 
nonsingular for each i and a. With the standard non-subjective prior 7r(/3(a)) = 
constant, we have a closed form expression for the cross-validatory predictive den- 
sity. An alternative equivalent criterion, which is to be minimized with respect to 
a, is 

T(a) = -{y-X(a)f3{a))'{y-X(a)f3{a)) 
n 

- - E - x i(<*)0i(<*)Y{Vi - ^<(«)A(«)) 

r n 



i=l 

2 



(2.1) + ^^log 



\X'(a)X(a)\ 



r tl n \m<*)Xi(a)\ 

Note that T(a) is equal to the negative of the criterion (1.2) up to an additive 
constant. We will prove that minimization of T(a) is equivalent to minimization 
of the loss L n (a) (defined in (1.3)) in an appropriate asymptotic sense and this 
will lead to the desired asymptotic (predictive) optimality of the criterion under 
consideration. 

Note that the loss L n (a) defined in (1.3) can be written as 

nL n (a) = nA n (a) + e'P(a)e 

where nA n (a) = fj,'(l — P(a))fi and let 

nR n (a) = E(nL n (a)) = nA„(a) + a 2 p n (a). 

One of the key assumptions under which we prove our results is the following 
condition ([12], [16]): 



1 

\nR. 

a£A. 



for some positive integer m for which E(e\ m ) < oo. We also assume 

(2.3) . PnX : - ( : - 0, 

mm nH n {a) 

aEA„ 

where A n = log(n/(n — k)). 
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For certain remarks justifying these assumptions, see [12] and [16]. In particular, 
it is argued in these papers using several concrete examples, that condition (2.2) 
is a natural one when the dimension p n of the largest model grows with sample 
size. Also, if p n remains bounded, nR n (a) is expected to go to oo for all a as 
the sample size increases, if the candidate models are separated from the truth. 
That min nR n (a) — > oo is assumption A. 3' of Li [12] and as remarked therein, it 

a 

is a quite reasonable assumption if p n grows with n. Condition (2.3) requires that 
min nR n (a) — > oo at a suitable rate. Under condition (3.3) below ([16], condition 

(2.5)), (2.3) holds if {p n X n )/n -> 0. 

It is important to note that we also need to assume (n — k) /n — -> to prove our 
results (see e.g. (6.10)). This addresses an important question about the required 
size of the training sample. We, however, do not claim that it is a necessary condition 
for asymptotic predictive optimality. 

We now consider the criterion T(a) as defined in (2.1). Since X(a)f3(a) = P(a)y, 

±(y-X(a)p(a))'(y-X(a)P(a)) 
n 

= -y'(I-P(a))y 

n 

(2.4) = -e'e + L n (a)--e'P(a)e+-e'(I-P(a))fi. 

n n n 



Similarly, 



r — ^ 7i 



r * — ' n 

i—l 



___ e ' e + _L £ _ Pi(a))M . _ JL e^(a)e, 

i=l i=l 



(2.5) + —f^eXl - PiWfii. 

nr * — ' 

We first state two auxiliary results. 
Lemma 2.1. Under conditions (2.2) and (2.3), 

-{y - X(a)0(a))'(y - X(a)0(a)) = -e'e + L n (a) + o p (L n (a)) 
n n 

uniformly in a G A n . 

By saying Z n (a) = o p (L n (a)) uniformly in o;, we mecin nicix \ Z n (a)\/L n (a) A 0. 

a 

Lemma 2.2. Suppose that conditions (2.2) and (2.3) hold and (n — k)/n — > 0. 
Then 

- £ -{ Vi - XiMPiWfa - XiWPiia)) = ^e'e + o p (L n (a)), 
r * — ' n ?i z 

i=i 

uniformly in a £ A n . 
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Proofs of Lemma 2.1 and Lemma 2.2 are given in the Appendix. 
In order to prove the main result of this section we need to assume another 
condition which is given below. 
Let 

(2.6) a in (a) = log 

We assume 



(n- k)P^\X'(a)X(a)\ 
^M\xJ(ajx~(a)\ 



(2.7) 



max 



i J2 a ln {a) 

i=l 

nR n (a) 



0. 



Remark 2.1. Let x'^a), . . . ,x' n (a) be the n rows of X(a). If these n rows are 
"similar", e.g., if they can be thought of as (independent) realizations of a random 
vector x and p n is small compared to both n — k and n, then 



X'(a)X(a) 




n 





1 " 



3 = 1 



and similarly 



i \E(xx')\ 
\E{xx')\. 



In this case, it follows that Oj n (a) ~ 0. In such a situation, assumption (2.7) seems 
to be quite reasonable. 

Now note that (2.3) and (2.7) will imply that the third term in the right hand 
side of (2.1) is also of the order o p (L n (a)) uniformly in a e A n . Thus 

F(a) = constant + L n (a) + o p (L n (a)) uniformly in a G A n 

which implies minimization of T(a) is essentially equivalent to minimization of 
L n (a) in an appropriate asymptotic sense and we have the following result. 

Theorem 2.1. Suppose that conditions (2.2), (2.3) and (2.7) hold and (n—k)/n — ► 
0. Then we have the following results. 

(a) r(a) = ^e'e + L n (a) + o p (L n (a)) uniformly in a £ A n . 

(b) The model selection rule under study is asymptotically optimal in the sense 
that 



min L n (a) 



1 



where a n is as defined in Section 1. 

Proof of Theorem 2.1 is given in the Appendix. 



3. Case with er 2 unknown 

We now consider the more realistic situation when the variance a 2 is unknown. 
The standard non-subjective prior in this case is 7r(/3(a), a 2 ) cx under model 
a. Interestingly, the results in this case follow from the basic results obtained in 
Section 2. We consider here the ( "model false" ) setup and assumptions of Section 2. 



Optimality of a predictive approach to model selection 



145 



Let y i ,i = 1, . . . , r be the r training samples chosen. The cross- validatory pre- 
dictive density under model a for a training sample y i is given by 

|A?(a)X,(a)|* x [(y - X(a)0(a)y(y - X(aMa))}-% 



\X'(a)X(a)\* [{ Vl - XiWPAaMvi - ^(a)/3>))]"^ 

up to a multiplicative constant. 

Our criterion (to be minimized with respect to a), which is an average over the 
r training samples, is given by 



(3.1) r(«) = log[S(a)} — ]T iog[Si(a)] + - J2 lo S W? 



X'(a)X(a)\ 



Xi(a)Xi(a)\ 



where 5(a) = (y-X(a)$(a)y(y-X(a)$(a)) and S<(a) = (y i -X i (a)0 i (a)) , (y i 
X l (a)0 l (a)). 

Note that T(a) = (k/n)\og(na 2 ) + T 1 {a) where 



(3.2) r!(a)=log 



5(a) 



log 



i=i 



Si(a) 



1 r 



a) + -Pn(a)X n , 



i=l 



a in (a) is as defined in (2.6) and A„ = log(n/(n — k)). Therefore, minimizing T(a) 
(with respect to a) is equivalent to minimizing Ti(a) for all a. Let 



u n (a) = log 



In order to prove the asymptotic optimality of this method, we first note in Lemma 
3.1 below that Ti(a) is asymptotically equivalent to u„(a) and this in turn implies 
the desired conclusion as stated in Theorem 3.1. We prove these results by invoking 
certain conditions which we describe below. 

We first make the following assumption (see [16], condition (2.5)): 



(3.3) 



lim infminA n (a) > 



n — >oo a 



where A n (a) is as defined in Section 2. This may be thought of as an identifiability 
condition on the models in the model space, as appears in the discussion of Mervyn 
Stone on [16]. We further assume that 



(3.4) 



■ log n 



0, £^ _> and - > is bounded, 

71 77 — ^ 



1 r 

(3.5) — Va m (a)^0, 

nr ^-^ 

i=l 

and 

r 

(3.6) ^log(50>0 

i=l 

with probability tending to 1, where 5j is equal to Si(a) with a as the full model, 
i.e., a = {1, . . . ,p n }- One can give sufficient conditions for (3.6) based on the 
relative magnitude of r and (n — k) as n — > oo, to the effect that r is not too large 
compared with n — k which is the case for most practically implcmcntable schemes. 
We, however, do not record the details here. The final results of this section are 
now stated below. 
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Lemma 3.1. Under conditions (3.3)-(3.6), 

(3.7) Ti(a) = u n {a) + o p (u n (a)) uniformly in a. 

Theorem 3.1. Under conditions (3.3)-(3.6), 



(3.8) 1. L 



L n (a n ) P 



Both Lemma 3.1 and Theorem 3.1 are proved in the Appendix. 



4. The "model true" case and consistency 

We now show that if some model in the model space is true, the model selection 
procedure under study chooses the correct model of the smallest dimension in ad- 
dition to being asymptotically optimal. Thus this procedure not only captures the 
truth but at the same time is as parsimonious as possible. Although the assumption 
of a true model may not seem to be very realistic, our result in this section provides 
a validation of the method. We, however, consider only the simpler case when a 2 
is known. 

As in [16], let A n C A n denote all the proposed models that are actually correct. 
Thus for a G A n , = X(a)f3(a) for some /3(a) G W n ^ a \ In Section 2 we assumed 
that A„ is empty. It is important to note that all the results of Section 2 with A n 
replaced by A n — A c n hold under the corresponding assumptions with An replaced 
by An — A n . In particular, if 

<4 ' 1) J^m~ B 

for some positive integer m for which E(ef m ) < oo and 

(4-2) . PnK R , ; - 0, 

mm nH n (a) 

with A„ = log(n/(n — k)) 7 then 

(4.3) T(a) = -^e'e + L n (a) + o p (L n (a)) 

uniformly in a G A n — A n ■ 

For a G A n , (I - P(a))/z = and (I - P l {a))n l = V i. Therefore, from (2.1), 

(2.4) and (2.5) we have for a G A n 

4.4 T(a) = —e'e e'P(a e+— e \P t (a )e t + — ^Jlog ) ' ) 7 . 

i— 1 i— 1 x 1 t v / / 

Also L„(a) = ie'P(a)e for a G .4Jj. 
We now assume that 

(4 - 5) u ™ p a S w^r <0 °- 
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for some positive integer m such that E(ef m ) < oo (condition (3.10) of Shao [16]), 
and 

r 

i J2 a in {a) 

(4.6) max —. > 

a£A<= n Pn(a)\ n 

with A„ = log(^3j:) and a,„(a) as defined in (2.6). See Remark 2.1 in this context. 
Let ct c n be the model a in A n with smallest dimension. Using the above, we now 
have 

Proposition 4.1. Under conditions (4-1), (4-2), (4-5) and (4-6) 

(4.7) r(a) = —^e'e + -X n a 2 p n (a) + o p (-X n a 2 p„(a)) 

uniformly in a £ A n , and 
k 

(4.8) r(a^) = -rce + o p (L„(o!)) uniformly in a £ An — A^- 
Proof of Proposition 4.1 is given in the Appendix. 

Keeping in mind the above facts, we now proceed towards proving that this 
model selection rule chooses the most parsimonious correct model as claimed in 
Theorem 4.1 below. Towards this we first observe that (4.3) and (4.8) imply 

k k 
max (r(o£) - - 5 e'e)/(T(a) - -~e'e) < 1 

with probability tending to 1. It then follows that 

(4.9) P[?(a c n ) < r(a) Va e A n - A n ] -» 1. 
We now try to find some conditions under which 

(4.10) P[r«) < r(a) Va g A c n ] - 1. 
Let n[r(a) — r(a£)] = Z n (a). It is enough to show that 

(4.11) P[Z n (a) > Va € A n ] ^ I. 
Now, 

P[Z n (a) < for some a e A c n ] 

< J2 P i Z n(a) < 0] 

< J2 - E{Z n {a))\ > E(Z n (a))\ 

From (4.4) 

(4.13) Z n (a) - E(Z n (a)) = - £ e'JP(a) - P«)R - e'[P(a) - P«)]e 
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and E(Z n (a)) can be written as 

1 1 r 

—E(Z n (a)) = [p n (a) -p n {a c n )]\ n + - ^[a m (a) - a ln (a c n j\ 

a r i=l 

where aj n (a) is as defined in (2.6). If we assume 
1 r 

(4.14) - y^[gjn(g) - a ln {a c n )] = o p (\p n (a) - p n {a c n )]\ n ) 



r 

i=l 



uniformly in a € A c n , then 

(4.15) ^E{Z n {a)) = \p n (a) - p n {a c n )]\ n + o p ([p n (a) - p n (a c n )]\ n ) 

uniformly in a € A„- Noting that P{a) — P(a^) and Pi(a) — Pi(a%) are projection 
matrices and the first term on the right hand side of (4.13) can be expressed as 
e'Me for some matrix M, and using Theorem 2 of Whittle [18] or inequality (6.2) 
of the Appendix we have 

E\Z n (a) - E(Z n (a))\ 2m < constant [p n (a) -p n (a c n )} m . 

It then follows from (4.12) and (4.15) that (4.11) holds if 

(4-16) V x9 , -, } 7 — ►O. 

^ \l m \p n (a)-p„(a° n )]™ 

Thus we finally have the following. 

Theorem 4.1. Under conditions (4.1), (4.2), (4.5), (4.6), (4.14) and (4. 16), 

(4.17) P[a n - <] 1. 

It is proved in the Appendix that under (4.1) and (4.2) 

(4.18) max ^ 0. 

aeAn-A^ L n [a) 

Since L^a^) < L n (a) Va G A^, Theorem 4.1 and (4.18) imply the following. 
Theorem 4.2. Under the conditions of Theorem 4-1, one has 

(4.19) L n {a n )/L n {al) A 1. 



5. Concluding remarks 

In this article we have studied predictive optimality of a cross-validatory Bayesian 
approach to model selection in the context of selecting from among a set of linear 
models. It has been shown that this method predicts as well as the oracle as the 
sample size grows. In addition, it has been shown that in case the space of candidate 
models contains at least one correct model, this method chooses the correct model 
with the smallest dimension with probability tending to one as sample size grows. 
Thus the method has two important facets - one of an optimal predictor and the 
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other of a selection criterion which does not unnecessarily choose a complex model 
when simpler ones are apt. 

Needless to say, this article has not addressed some interesting related issues. 
First, it will be interesting to see how this method works when it is applied in the 
setup of generalized linear models, through theoretical investigation and simula- 
tion. Another focus of recent research is the case when the number of potential 
parameters in the models is very large, e.g., when it is of the same order as the 
number of observations. Asymptotic optimality studies in such setup, even for the 
normal linear models will be a really challenging task. Also, we have not touched 
upon the computational aspect of this method, which becomes important if the 
number of potential regressors and number of models in the model space get large. 
We, however, emphasize that one rarely considers the set of all 2 P possible mod- 
els if p regressors are available. For example, one can use expert knowledge about 
the problem under study and start with a pruned list of models or one can take 
a nested sequence of models (thereby restricting the total number of models to 
at most p). Li ([12], Example 1) considered a situation where the p regressors are 
arranged in decreasing order of importance. He then considered p models, the a-th 
model consisting of the first a regressors in this ordered arrangement. See in this 
context Examples 1 and 2 of [16] where the number of models under consideration 
is fixed although the number of parameters may grow with sample size. Last but 
not the least, as we commented before, the requirement that k/n — > 1 is only a 
sufficient condition; a careful study of the necessity of this condition is in order. In 
some examples, we have observed that k/n — > c for any c € (0, 1) is also sufficient to 
achieve good optimality results similar to ones we have obtained in this paper. Some 
theoretical investigations and simulation studies will hopefully prove conclusive to 
find the optimal k. It is worth mentioning that in a related problem Chakrabarti 
and Ghosh [5] made interesting observations regarding this issue which can be a 
starting point for such investigation. 



Appendix 

We present in this section proofs of some of the results of the earlier sections. We 
will need bounds for the moments of linear and quadratic forms in e. Let A = (a^ ) 
be a non-random nxn matrix and b be a non-random n-vector. Then by Theorem 2 
of Whittle [18], 

(6.1) E(\e'b\ 2m ) < C^Wbll 2 )" 1 , and 

(6.2) E\e'Ae - E(e'Ae)\ 2m < C 2 (^^a?-) m 

» 3 

for some constants C\, C% > and for positive integer m for which E(e\ m ) < oo. 
Below max will mean maximum over a € A n - 

a 

Proof of Lemma 2.1. As shown in Li ([12], p. 970), using Theorem 2 of Whittle [18] 
or inequalities (6.1) and (6.2) stated above, and condition (2.2), we have 

t r* q\ \e'P(a)e - a 2 p n {a)\ P 

(6.3) max — > 0, and 

a nRn(a) 

(6.4) max'^-^Uo. 

a nR n (a) 
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Also, from (6.3) 

(6.5) max |^44 " 1| = ™x J> . 

a R n (a) a nR n (a) 

Lemma 2.1 now follows from (2.3), (2.4), (6.3), (6.4) and (6.5). □ 
Proof of Lemma 2.2. Let 

i—l i—1 i—1 

Then, in view of (2.5), the left hand side of the equality claimed in Lemma 2.2 can 
be written as 

n — k . 1 ,„ „ „ , 
^—e'e+-(T 1 -T 2 + 2T 3 ). 
n z n 

We shall prove that 

(6.6) Tj/n = o p (L n (a)) uniformly in a 

for j = 1,2, 3. 

We fix a training sample y 1 = (yi, y 2 , ■ ■ ■ , y n -k)' ■ Let 

X(a)= ( Xu ) and I ~ P ^)=( b 

where X\ and X\ c are the submatrices consisting of the first n — k rows and the 
last k rows of X, respectively, and A and B are analogous submatrices of / — P{a). 
Then 

(6.7) fi'(I-P{a))n = fJ,'B'Bfj, + n'A'Afj,, and 

(6.8) t i'{I-P{a))^-n' 1 {I-P 1 {a))n 1 = (i'B'(I — P c ) _1 S/ti, 

where P c = X\ C (X' '(a)X(a))~ 1 X' lc (see, e.g., Result (5.4) of Chattcrjcc and Hadi 
[6], p. 189). One can now check that (I - Pc)- 1 = I + Xi^X^X^X^ and 

(6.9) fi'B'(I - Pc^Bfj, - fx'B'Bfj, = fi'B'XuiXiX^X^Bn > 
as (X[Xi)~ 1 is positive definite. From (6.7)-(6.9) 



Mi(f-Pi(a))Mi < jjiMMK < ll^ll 



2 



nL„(o) " fi'{I-P(a))n ~ \\A^\\ 2 + \\B^\\ 2 ' 

We now consider average over the r training samples. Since each j/j (1 < i < n) 
appears in the same number of training samples, we have 

-J Q\ 1 1 ^ 1=1 < " K 

1 ' j n£„(a) ~ - P(a))fj, ~ n 

which converges to zero. 

To prove (6.6) for j = 2 we note that T2 can be expressed as e'M(a)e for some 
matrix M(a) = (rriij), which is a sum of r matrices corresponding to the r choices 
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of the n — k indices from {1, 2, . . . , n} (n — k rows of X(a)). For example, for the 
training sample y 1 = [y\, . . . , y n -k)\ e[Pi(a)ei may be written as e'Mi(a)e where 

Mi (a) = ( Pl ^ a) thus M(a) = (1/r) £jlf 4 (a). 

As Pj(a)'s are all idempotent matrices, one can show that ^ ^ mfj < p n (a). Then 

» 3 

proceeding as in the proof of (6.3) given in Li ([12], p. 970) one can prove the result 
using (6.2), (2.2), (2.3) and (6.5). Indeed, by (6.2), 



P 



\e'M(a)e — a 2 p n (a)\ 
max ^71 > 6 



[nR n {a)f m 

for some constant C > 0. The result follows from (2.2), (2.3) and (6.5). 

The proof of (6.6) for j = 3 is similar. We note that T3 = e'b with b = 

(1/r) E(I - PiWto and ||6|| 2 < (1/r) £ ^(/-P ? (a))^. By (6.1) and (6.10) 

i=l i=l 



P 



|e'6| 

max 5 ? ^ > e 



[nR n (c 



for some constant C > 0. The result follows from (2.2) and (6.5). Thus (6.6) is 
proved and hence the lemma. □ 

Remark 6.1. Indeed, to prove Lemma 2.1 and Lemma 2.2, we need to assume 

Pn 



min nR n {a) 







instead of the stronger condition (2.3). We, however, need (2.3) to prove our final 
result. 

Proof of Theorem 2.1. Since (2.3) and (2.7) imply that the third term in the right 
hand side of (2.1) is of the order o p {L n (a)) uniformly in a £ A n , part (a) follows 
from (2.1), Lemma 2.1 and Lemma 2.2. From part (a), T(a) can be written as 

k 

T(a) = —»e'e + L n (a)(l + („(a)), a E A n , 
where max |Cn(oOI 0. Now T(a n ) < T(a) V a implies 

a 

< i + Cn(q) < l + ^ 

L n (a) ~ 1 + Cn(«n) ~~ 1 - max |Cn(a)| 

a 

Part (b) follows from the above. □ 
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Proof of Lemma 3.1. We first note that under suitable conditions there exist < 
5 < A such that 



(6.11) 



log(l + S) < u„(a) < log(l + A) Va 



with probability tending to 1. This follows from (3.3), (3.4) and the fact that 
e'e/na 2 — > 1, noting that maxe'P(a)e/n < e'Pe/n A and L n (a) is uniformly 

a 

(in a) bounded with probability tending to 1. Here P is the projection matrix 
corresponding to the full model. 

Consider now the expression in (3.2). By Lemma 2.1 of Section 2 and (6.11), 

log[S(a)/ra<7 2 ] = \og[e'e/n<j 2 + L n (a)/a 2 + o p (L n (a)/a 2 )} 

= \og[e'e/no- 2 + L n (a)/a 2 + o p (e'e/na 2 + L n (a)/a 2 )} 
= log[e"»W(l + 0p (l))] 
= u n (a)+o p (l) 
(6.12) = u n (a) + o p (u„(a)) 

uniformly in a. In view of (3.5), to prove (3.7), it remains to show 
n — k 



(6.13) 



J2log[S l (a)/na 2 ]=o p (l). 



Note that we are also using (3.4) and (6.11). Since Si(a) > Si for all a and all i, 
we have for all a 

o<~ir io g [s,(a)] = login ^(«)] i/r ^ lQ g[- E 



implying 



log(?iCT 2 ) < 



< 



i=l 



- k 



- k 



S l {a) 



E lo S 



- k 



log (no- 2 ). 



Then (6.13) follows from Lemma 2.2 of Section 2, condition (3.4) and the fact that 
L n (a) is uniformly (in a) bounded with probability tending to 1 (as noted earlier 
in the argument for (6.11)). □ 

Proof of Theorem 3.1. Let a n be the model which minimizes L(a). Proceeding as 
in the proof of part (b) of Theorem 2.1, and using (3.7) we can prove that 



This, together with (6.11), imply that 



1. 



u n (a n ) - u n (a£) A 
e'e 4- nL^.(a„) n 



Since 
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L n {a n ) 



1. 



□ 



Proof of Proposition J^.l. We first prove equation (4.7). Below, by max we mean 

a 

maximum over a € A^. Let Z n (a) — (e' P(a)e)/ (a 2 p n (a)). We first show that 
max|Z„(a)| = O p (l). By (6.2) 

a 

P[max|Z n (a) - 1| > M] 

a 

< J2 E \ Z "( a ) - l| 2m /M 2m 



< 



M 2r 



for some constant C > and by (4.5) this can be made arbitrarily small by choosing 
suitable M > 0. Thus max|Z„(a) — 1| = O p (l) implying m&x\Z n (a)\ = O p (l). 

a a 

This implies (l/n)e'P(a)e = o p (^\ n o~ 2 p n (a)) uniformly in a £ A„ as A„ — > oo. 

r 

Proceeding in a similar manner and noting that (1/r) ^2 ^Pj(a)ei can be written 
as e'M(a)e (see proof of Lemma 2.2) one can prove 

1 r 1 

— > e'Pi(a)ei = o„( — \ n a 2 p n (a)) uniformly in a £ AL. 

Tt r — ^ n 



The result now follows from (4.2), (4.4) and (4.6). 

In order to complete the proof of Proposition 4.1, we now prove equation (4.8). 
From (4.7), 

k 1 / 1 

r (0 = — e'e + -\ n o- 2 p n (a c n ) + o p -\ n o- 2 p n (a c n ) 
n z n \n 



The result follows from (4.1) and (4.2) noting that (4.1) implies (6.5) with max 
replaced by max . □ 

a£A n -A^ 

Proof of (4.18). Note that 



MO 



e'P«)e 



ae*4„-.A^ L n (a) a nL n (a) 
By (6.2) and by arguments used earlier 

e'P{a c n )e-cr 2 p n (a c n ) 



max 

a£A n -AS, 



< c 



nR n (a) 



> e 



Pn 



min nR n (a) 



E 



aeA„ -A: 

for some constant C. The result follows from (4.1) and (4.2) 



□ 
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